Hi everyone! For my first Medium post I wanted to do something fun and interesting, and to provide a real-world example of the power of

1 Abstract

Background: The on-going COVID-19 pandemic has taken more than 4 million human lives worldwide since it started on 2020. Novel vaccines were developed at the end of 2020 to reduce the mortality rate associated with this respiratory infectious disease. There has been a different level of vaccine adoption among countries, so the key question this report seeks to answer is whether countries with higher vaccination rates are associated with lower COVID-19 case fatality rate.

Method: Using country and monthly case, death and vaccination data related to COVID-19, a generalized linear model with a negative binomial distribution was fitted, to estimate the effect of vaccination at country level in reducing the case fatality rate (CFR) of COVID-19. The model was controlled by each country heterogeneity and the time development of the COVID-19 disease.

Results: A statistical significant relationship was found, higher vaccination rates in a country reduces the relative risk of COVID-19. In comparison to the group with 0%-25% vaccinated people, countries in the 25%-50% group have a 19% [C.I. 95%: -27%;-11%] lower CFR for COVID-19, countries in the 50%-75% group have a 27% lower CFR [-35%;-18%] and countries in the 75%-100% group have a 37% lower CFR [-48%;-24%]. Additionally, a 10% increase in the percentage of vaccinated people results in a reduction of the COVID-19 CFR of 6.56% [C.I. 95%: -8.48%;-4.61%]. Potential limitations of the analysis are due to the potential presence of time and spatial autocorrelation in the data.

Conclusion: A negative association between higher vaccination rates and lower CFR for COVID-19 was found. Countries with current low vaccination rates should prioritize a policy to vaccinate their population with the goal to reduce the death toll of the COVID-19 pandemic.

2 Introduction

The on-going COVID-19 pandemic has taken more than 4 million human lives worldwide since it started on 2020. Novel vaccines were developed at the end of 2020 to reduce the mortality rate associated with this respiratory infectious disease. There has been a different level of vaccine adoption among countries, so the key question to answer is whether country with higher vaccination rates among their population are associated with lower COVID-19 case fatality rates (CFR).

MENCIONAR ESTUDIO RCT EXISTENTES QE MUESTRAN LA EFICACIA DE VACUNAS COVID

2.1 Available Data

As an international organization with the mission to improve human health, the World Health Organization (WHO) has been compiling daily country level data for the COVID-19 pandemic. This data is easily accessible to the general public (link to download). The data contains COVID-19 daily related cases and deaths for a variety of countries. Here are the resolutions:

  • Temporal resolution: Day, from 2020 (January) to 2022 (present)
  • Geographical re solution: Country (237 unique) and world region (7 in total)
  • Outcome to study:
    • Case Fatality Rate (CFR): Measurement of amount of patients with COVID-19 that died (\(CFR=\frac{Deaths}{Cases}\))

MEJORAR TEXTO: mas explicito

We are interested in exploring the effect of vaccines on the case fatality rate for COVID-19. Our World in Data is an organization that compiles country level data from multiple sources to make research more easy. They have a made a strong effort to compile a database with the total number of people fully vaccinated (people who received all doses prescribed by the vaccination protocol) by date and country. The vaccination data is available here, with a public GitHub.

For this project the variable to evaluate vaccination status was the total people fully vaccinated per 100 people in the country, or the percentage of people fully vaccinated. This metric was chosen for several reasons:

  • It is already normalized by total population in each country.
  • Only considers people that have completely fulfill the requirements for the vaccine administered, whether is 1 or 2 doses required.
  • The vaccine type was not considered in the scope of this analysis, but it is a factor that could be considered in future studies.

3 Background

The virus causes an inflammatory respiratory stress on the human body. In the two years of the pandemic, approximately 5 million people have died, and more than 400 million have been infected. Several countries have taken different methods to mitigate the dispersion and impact of the pandemic, such as mobility restrictions (quarantines), social distancing measures, mask enforcement. A little more than a year ago (end of 2020) novel vaccines against COVID-19 were developed, which have proven useful to reduce the death rate due to COVID-19.

Several studies have found a negative association between vaccination and case fatality rate of COVID-19 (Liang et al., 2021; Passarelli-Araujo et al., 2022). In particular, Liang et al. (2021) found that a 10% increase in the vaccination among the population decrease the CFR of COVI-19 by 7.6%. This study uses a panel data consisting of 90 countries over 25 weeks, fitting a country-level random effects model (Liang et al., 2021). The variable for vaccination level was the number of people per 100 habitants that have received at least one dose of the vaccine. The question remains open to study the effect of people fully vaccinated (scheme completed), which this report seeks to answer.

4 Method

4.1 Data preparation

Both datasets (WHO and OurWorldinData) were merged based in the country name, resulting in a match of 167 countries in total. For these countries we got reliable information on COVID-19 cases and deaths, and vaccination status for a given number of months, that may differ according to the start date of vaccination campaigns in each country.

As the COVID-19 disease usually takes two weeks to fully develop into a mortal disease, a two week shift in the death data (bring backward) was considered to align cases with deaths in the estimation of the monthly CFR for each country.

To improve the estimation and robustness of the analysis, the CFR was estimated for each country only in the months in which there were at least 100 cases and more than 10 COVID-19 deaths.

PQ MES: AGREGAR DATOS DE MUERTES/COVID Y ESTADO VACUNACION

4.2 Statistical Model

MENCIONAR QUE SURVIVAL ANALYSIS NO APLICA

The time series data for COVID-19 cases and deaths, and vaccination rates per country was converted to a panel data: that is a cross-sectional ecological study (main observational unit is the monthly variables per country). CFR and vaccination rates were summarized to monthly averages to balance the trade-off between number of observations and accuracy of predictions. A monthly average allow us to have approximately 10 observations per country, and by averaging for the whole month we are reducing the random daily noise.

The model to be analyzed will be a generalized linear model, using panel data (ecological cross-sectional with a time series):

\[ log(Y_{i,t})=\beta_0+\beta_1 (Vaccination \: status)_{i,t}+\beta_{2,i} (Country)_i+\beta_{3,t} (Month)_t+\epsilon_{i,t} \]

Where:

  • i: Index for each specific country: 167 in total.
  • t: Index of month: December 2020 to July 2022.
  • \(Y_{i,t}\) is the case fatality rate per country “i” on month “t”. The CFR is calculated by dividing the deaths for the cases that ocurred in the country two weeks before.
  • \(\beta_1\) is the effects of each vaccination level on the case fatality rate. The model will be fitted considering vaccination level as a categorical variable. The average percentage of people fully vaccinate in a given month is classified into 4 groups: [0-25%; 25%-50%; 50%-75%; 75%-100%]. The data is divided into 4 groups to capture the differences across major vaccination levels. Additional sensitivity analysis is done by fitting a model with 10 vaccination level categories and with vaccination as a numeric variable.
  • \(\beta_{2,i}\) is the specific effect of the pandemic in each country
    • Note that other potential confounding variables for each country don’t need to be included, as the \(\beta_2\) coefficient capture the particularities (or heterogeneity) for each country.
  • \(\beta_{3,t}\) is the time effect of the pandemic, representing the development of new variants of COVID-19, advances in medicine or treatment or any other variable that may have a time effect on the case fatality rate for COVID-19.

The outcome variable in the model corresponds to number of deaths normalized by total cases of COVID-19 (the case fatality rate). The number of deaths were transformed using a negative binomial distribution with a logarithmic link function, as the negative binomial is a better fit for count data than the Poisson when there is over-dispersion, as it allows the variance to be different from the mean. The negative binomial distribution have been used in other studies for count data with over-dispersion, such as deaths (Haldar & Sethi, 2020).

Each observation in the model is weighted in the regression using the total number of cases per month, which helps to resolve the issue of differences in size between countries. An analysis of the distribution of the outcome variable and the appropriateness for a logarithmic transformation is provided in the Appendix section.

The major assumptions of the model are:

  • The outcome variable follows a negative binomial distribution.
  • The model is correctly specified (no omitted variable): No other characteristics, beside vaccination, among countries have changed in the last year that could affect the CFR for COVID-19.
  • The data used reflects the COVID-19 pandemic in each country: that is that each country has the institutions to accurately identify cases, deaths and vaccinated people for COVID-19.

The main hypothesis to test is the effect of each vaccination level on the CFR for COVID-19, that is to test \(H_0:\beta_1=0\) vs \(H_A:\beta_1 \neq 0\). A likelihood ratio test is conduct to test whether the coefficient associated to vaccination status is significant. Furthermore, the coefficients are present along with their confidence intervals at 95% level. This model specification can only provide an answer regarding association of vaccination rates and CFR at country level, and does not provide an statement about causal inference nor efficacy of the vaccine at the individual level.

To interpret the results, the coefficients are presented as relative risk (RR). This represents the relative change in the case fatality rate for COVID-19. For example, a coefficient with a RR of 0.9, means that this variable reduces in 10% the mortality rate for COVID-19, while controlling by other variables in the model.

4.2.1 Sensitivity analysis

All tests on the residuals of the generalized linear model were conducted using a simulation-based approach using the DHARMa library in R (Hartig, 2022). The following test were conducted: dispersion of residuals and a simulated QQ-plot. Results from this test are presented in the appendix.

In addition to testing the main model, several scenarios were conducted to analyze results in more robust way:

  • Model with vaccination as a numeric variable.
  • Remove observations with outliers values for CFR above 10% in a given month.
  • Test a model with more categories for vaccination level variable: [0%-10%, 10%-20%, …, 90%-100%].
  • Fit mixed generalized linear model with countries as a random effects variable, such as the study of Liang et al. (2021).
  • Use bootstrap methods to estimate the confidence intervals for the original model.

5 Results

# libraries
library(tidyverse) # data manipulation
library(scales) 
library(lubridate) #time format
library(gridExtra) # plot manipulation
library(plotly) # interactive plots
library(skimr) # summary statistics
library(flextable) #table format for hhtml
library(DHARMa) # Simulations test for GLM
theme_set(theme_bw(16)+theme(panel.grid.major = element_blank()))
# COVID-19 Data -----
# load WHO COVID-19 data
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")

# Data wrangling (preparation for analysis)

# shift deaths two weeks early
covid <- covid %>% 
  group_by(Country) %>% 
  mutate(deaths_shift=lead(New_deaths,14,
                          order_by = Date_reported)) %>% 
  ungroup()


# filter to remove March 2022 - Updated to August
covid <- covid %>% filter(Date_reported<"2022-08-01") 

## Create time factor variable, splitting years in half
halfYear_levels <- c("2020-1","2020-2","2021-1",
                     "2021-2","2022-1","2022-2")
month_levels <- c("2020-12","2021-1","2021-2","2021-3",
                  "2021-4","2021-5","2021-6","2021-7","2021-8",
                  "2021-9","2021-10","2021-11","2021-12",
                  "2022-1","2022-2","2022-3","2022-4","2022-5",
                  "2022-6","2022-7")

covid <- covid %>%  
  mutate(period=case_when(
    Date_reported < as.Date("2020-07-01") ~ "2020-1",
    Date_reported < as.Date("2021-01-01") ~ "2020-2",
    Date_reported < as.Date("2021-07-01") ~ "2021-1",
    Date_reported < as.Date("2022-01-01") ~ "2021-2",
    Date_reported < as.Date("2022-07-01") ~ "2022-1",
    Date_reported < as.Date("2023-01-01") ~ "2022-2",
    T ~ "other") %>% factor(levels=halfYear_levels),
    month=paste(year(Date_reported),month(Date_reported),sep="-") %>% 
      factor(levels =month_levels))


# clean data: remove negative cases and deaths
covid <- covid %>% 
  mutate(New_deaths=deaths_shift) %>% 
  filter(New_cases>=0 & New_deaths>=0)

## need to summarize COVID data to periods of time: MONTHS
covid_period <- covid %>% 
  group_by(WHO_region,Country,month) %>% 
  summarise(cases=sum(New_cases),
            deaths=sum(New_deaths)) %>% ungroup() %>% 
  filter(cases>99 & deaths>9) %>%  # filter to remove unreliable periods
  mutate(cfr=deaths/cases)  # we calculate CFR for each given month

# VACCINATION DATA -----
## Load vaccination data
vaccination <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv")

# Filter to remove March 2022 - Updated to August 2022
vaccination <- vaccination %>% 
  filter(date<"2022-08-01")

# We summarized vaccination data based on the a time period: MONTH
## Create time factor variable, splitting years in half
vaccination <- vaccination %>% 
  mutate(period=case_when(
    date < as.Date("2020-07-01") ~ "2020-1",
    date < as.Date("2021-01-01") ~ "2020-2",
    date < as.Date("2021-07-01") ~ "2021-1",
    date < as.Date("2022-01-01") ~ "2021-2",
    date < as.Date("2022-07-01") ~ "2022-1",
    date < as.Date("2023-01-01") ~ "2022-2",
    T ~ "other") %>% factor(levels=halfYear_levels),
    month=paste(year(date),month(date),sep="-") %>% factor(levels=month_levels))

# take the average over the period: NA is 0
vaccination_period <- vaccination %>% 
  mutate(vaccinated=replace_na(people_fully_vaccinated_per_hundred,0)) %>% 
  group_by(location,month) %>% 
  summarise(vaccinated=mean(vaccinated,na.rm=T)) %>% ungroup() 

# JOIN DATA -----

## Change countries name to increase matching (manually, based on visual inspection)
covid <- covid %>% 
  mutate(Country=case_when(
    Country=="Bolivia (Plurinational State of)" ~ "Bolivia",
    Country=="Bonaire" ~ "Bonaire Sint Eustatius and Saba",
    Country=="Brunei Darussalam" ~ "Brunei",
    Country=="Cabo Verde" ~ "Cape Verde",
    Country=="Côte d’Ivoire" ~ "Cote d'Ivoire",
    # str_detect(Country,"C.te d.Ivoire") ~ "Cote d'Ivoire", 
    Country=="Curaçao" ~ "Curacao",
    # str_detect(Country,"Cura.ao") ~ "Curacao",
    Country=="Democratic Republic of the Congo" ~ "Democratic Republic of Congo",
    Country=="Falkland Islands (Malvinas)" ~ "Falkland Islands",
    Country=="Faroe Islands" ~ "Faeroe Islands",
    Country=="Iran (Islamic Republic of)" ~ "Iran",
    Country=="Kosovo[1]" ~ "Kosovo",
    Country=="Lao People's Democratic Republic" ~ "Laos",
    Country=="occupied Palestinian territory, including east Jerusalem" ~ "Palestine",
    Country=="Pitcairn Islands" ~ "Pitcairn",
    Country=="Republic of Korea" ~ "South Korea",
    Country=="Republic of Moldova" ~ "Moldova",
    Country=="Russian Federation" ~ "Russia",
    Country=="Sint Maarten" ~ "Sint Maarten (Dutch part)",
    Country=="Syrian Arab Republic" ~ "Syria",
    Country=="The United Kingdom" ~ "United Kingdom",
    Country=="Timor-Leste" ~ "Timor",
    Country=="United Republic of Tanzania" ~ "Tanzania",
    Country=="United States of America" ~ "United States",
    Country=="Venezuela (Bolivarian Republic of)" ~ "Venezuela",
    Country=="Viet Nam" ~ "Vietnam",
    T ~ Country))

# Join countries from both data sources
# countries_covid <- unique(covid$Country)
# countries_vaccine <- unique(vaccination$location)
# 
# countries_vaccine <- data.frame(countries_vaccine=countries_vaccine,
#                              x=1)
# countries_covid <- data.frame(countries_covid=countries_covid,
#                               y=1)

# are the codes similar?
# countries_vaccine %>%
#   left_join(countries_covid,by=c("countries_vaccine"="countries_covid")) %>%
#   pull(y) %>% sum(na.rm=T)
# 215 countries match


# join based on location - I join it to the Vaccination, so I can get months with vaccine information for each country
df <- vaccination_period %>% 
  left_join(covid_period, by=c("location"="Country",
                               "month")) %>% 
  filter(cases>=0 & deaths>=0 & !(is.na(vaccinated)))

## More convenient labels for region
df <- df %>% 
  mutate(region_name=case_when(
    WHO_region=="EMRO" ~ "Eastern Mediterranean",
    WHO_region=="EURO" ~ "Europe",
    WHO_region=="AFRO" ~ "Africa",
    WHO_region=="AMRO" ~ "Americas",
    WHO_region=="WPRO" ~ "Western Pacific",
    WHO_region=="SEARO" ~ "South-East Asia") %>%
      factor(levels=c("Americas","Europe","Eastern Mediterranean",
                      "Western Pacific","South-East Asia","Africa")),
  location=as.factor(location))

# Vaccination level factor variable
df <- df %>% 
  mutate(vaccinated_level=case_when(
    vaccinated<25 ~ "0%-25%",
    vaccinated<50 ~ "25%-50%",
    vaccinated<75 ~ "50%-75%",
    vaccinated<100 ~ "75%-100%",
    T ~"other") %>% factor(levels=c("0%-25%","25%-50%",
                                    "50%-75%","75%-100%")),
    vaccinated_level_10=case_when(
    vaccinated<10 ~ "0%-10%",
    vaccinated<20 ~ "10%-20%",
    vaccinated<30 ~ "20%-30%",
    vaccinated<40 ~ "30%-40%",
    vaccinated<50 ~ "40%-50%",
    vaccinated<60 ~ "50%-60%",
    vaccinated<70 ~ "60%-70%",
    vaccinated<80 ~ "70%-80%",
    vaccinated<90 ~ "80%-90%",
    vaccinated<101 ~ "90%-100%",
    T ~"other") %>% factor(levels=c("0%-10%","10%-20%","20%-30%",
                                    "30%-40%","40%-50%","50%-60%",
                                    "60%-70%","70%-80%","80%-90%",
                                    "90%-100%")))
# df$vaccinated_level_10 %>% table()

5.1 Descriptive analysis

5.1.1 COVID-19

Let’s provide some summary statistics for the COVID-19 cases and deaths data.

# function to estimate summary statistics:
f.sum <- function(x){
  return(
    data.frame(
     Mean=mean(x,na.rm=T),
     sd=sd(x,na.rm=T),
     Min=min(x,na.rm=T),
     Median=quantile(x,0.5,na.rm=T),
     Max=max(x,na.rm = T)
    )
  )
}

# average number of months per county:
avg_country_month <- df %>% 
  group_by(location) %>% tally()

# Table with summary statistics
rbind(`Number of months per country`=f.sum(avg_country_month$n),
      `COVID-19 Cases per Month`=f.sum(df$cases),
      `COVID-19 Deaths per Month`=f.sum(df$deaths),
      `Case Fatality Rate (CFR) %`=f.sum(df$cfr*100),
      `People Fully Vaccinated (%)`=f.sum(df$vaccinated)) %>% 
  rownames_to_column() %>% rename(Metric=rowname) %>% 
  flextable() %>% autofit() %>% 
  colformat_double(digits=2) %>% 
  colformat_double(i=2:3,digits=0) %>% 
  bold(part = "header") %>% 
  set_caption(paste0(
    "Summary Metrics: Monthly observations per country. n = ",nrow(df))) %>% 
  footnote(i=1,j=1,
           as_paragraph(paste0(length(unique(df$location)),
                               " Countries considered")))
rm(avg_country_month)

We observe:

  • For the 167 countries considered, we have an average number of months with an estimation for the CFR of 10.20. This number will probably be sufficient to draw inferences on the effect of vaccination level on CFR, while controlling for each country characteristics.
  • The CFR has a mean value of 2.12%, with a large variance. The range of the metric is also high, with countries with values of CFR above 75% in a given month.
  • We observe also a high dispersion in the vaccination level across monthly country observation, with some countries reaching more than 90 people fully vaccinated per 100 people.
#### Time series of COVID-19 Cases and deaths
p_region <- df %>% 
  pivot_longer(c(cases,deaths), names_to = "key", values_to = "value") %>% 
  ggplot(aes(month,value))+
  # geom_line()+
  geom_boxplot()+
  facet_wrap(~key,ncol=1,scales = "free_y")+
  labs(x="Date",y="")+
  scale_y_continuous(labels=function(x) format(x, 
                                               big.mark = " ", 
                                               scientific = FALSE),
                     trans = "log10")
# p_region


p_cfr_time <- df %>% 
  ggplot(aes(month,cfr))+
  geom_boxplot()+
   scale_y_continuous(labels = scales::percent,
                      limits = c(0,0.1))+
  labs(x="Date",y="Case Fatality Rate COVID-19 [%]")+
  theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_cfr_time

p_vacc_time <- df %>% 
  ggplot(aes(month,vaccinated))+
  geom_boxplot()+
  labs(x="Date",y="People Fully Vaccinated per 100 people")+
  theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_vacc_time

## combine graphs
# grid.arrange(p_cfr_time,p_vacc_time,ncol=1)

5.1.1.1 Case Mortality Rate COVID-19

Our main outcome of interest is the case fatality rate for COVID-19. Let’s look at the total case fatality rate for COVID-19 for the entire pandemic period:

LOS PEAKS POR REGION SE PODRAN EXPLICAR POR PEAKS DE PAISES?

CURVA VERDE TIENE UN PATRON INTERESANTE

p_cfr_all<- df %>% 
  filter(WHO_region!="Other") %>% 
  ggplot(aes(region_name,cfr))+
  geom_jitter(alpha=0.5,aes(text=paste0("Location: ",location,
                                        "\n Month: ",month)))+
  geom_boxplot(alpha=0.5,outlier.alpha = 0)+
  coord_flip()+
  scale_y_continuous(labels = scales::percent, limits=c(0,0.1))+
  labs(x="WHO Region",y="Case Fatality Rate COVID-19 [%]",
       caption = "Cumulated cases and deaths for the period December 2020 to July 2022 \n Only CFR below 10% are shown")

ggplotly(p_cfr_all)
# percentage above 10%
# sum(nrow(filter(df,cfr>0.1)))/nrow(df)*100

We can see that the case fatality rate is around 2% for the majority of the observations gathered, that correspond at monthly rates for each country. We observe variations across each country and across WHO regions, with Europe and Western Pacific with the lowest CFR.

Let’s look at the case fatality rate over time:

# Time series CFR - NEED TO CONSIDER WEIGHTED AVERAGE
p_cfr_time <- df %>% 
  group_by(month,region_name) %>%
  summarize(cfr=mean(cfr,na.rm=T)) %>%
  ggplot(aes(month,cfr,
             col=region_name,
             group=region_name
             ))+
  geom_line(size=1.5)+
  # geom_boxplot()+
  scale_y_continuous(labels = scales::percent,
                     limits = c(0,0.1),
                     # breaks=c(0,0.05,0.1)
                     )+
  theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
  labs(x="",y="Case Fatality Rate COVID-19 [%]",col="WHO Region")

ggplotly(p_cfr_time)

We observe a higher value in the CFR in the early times of the pandemic, and an important reduction in the most recent times. We also observe important peaks and differences across WHO regions, probably due to different variations of the COVID-19 virus.

5.1.2 Vaccination Data

The next figure shows the vaccination status for each country at July 2022. We observe important differences in the access to vaccines across WHO regions, and within countries in the same region (each point in the figure represents a country). For example, Europe present a high vaccination rate, but we observe important differences in the vaccination level within the countries of Europe. The distribution of vaccination levels seems to be two-modal, with some countries with high vaccination and some with really low.

Further analysis is included in the appendix session, which shows a clear bi-modal distribution in every WHO region regarding vaccination level, except for Africa that has only low vaccinated countries.

p_vacc_all <- df %>% 
  filter(month=="2022-7") %>% #July 2022 
  ggplot(aes(region_name,vaccinated))+
  geom_boxplot(alpha=0.5,outlier.alpha = 0)+
  geom_jitter(alpha=0.5, size=2,
              aes(text=paste0("Location: ",location)))+
  coord_flip()+
  labs(x="WHO Region",y="People Fully Vaccinated [%]",
       caption = "Vaccination status at July 2022")

ggplotly(p_vacc_all)
# time series of vaccination - NEED TO CONSIDER WEIGHTED AVERAGE
# df %>% 
#   group_by(month,region_name) %>%
#   summarize(vaccinated=mean(vaccinated,na.rm=T)) %>%
#   ggplot(aes(month,vaccinated,
#              col=region_name,
#              group=region_name
#              ))+
#   geom_line()+
#   # geom_boxplot()+
#   theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
#   labs(x="",y="People Fully Vaccinated per 100 people",col="WHO Region")

5.1.3 CFR vs Vaccination Rates

Let’s make a comparison between the percentage of vaccinated people and the monthly CFR for each country. This comparison directly relates to our question of interest. The following boxplot shows the case fatality rate over different levels of vaccination, divided by WHO regions:

PQ SE OBSERVE UN EFECTO EN DISTINTA DIRECCION????

PUEDE SER RUIDO, EFECTO BORDE CATEGORIA, COMO TRATAMOS LOS OUTLIERS

EXPLICAR MEJOR EL GRAFICO, QE ES CADA COSA. QUIZAS UN JITTER GRAPH ES MEJOR!

p_vac_cfr_box <- df %>% 
  filter(!(is.na(WHO_region))) %>% 
  ggplot(aes(region_name,cfr,col=vaccinated_level))+
  geom_boxplot()+
  scale_y_continuous(labels = scales::percent,
                     limits = c(0,0.1)
  )+
  coord_flip()+
  # theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
  labs(col="Percentage of People \n Fully Vaccinated [%]",
       x="WHO Region",
       y="Case Fatality Rate COVID-19 [%]",
       caption="Each point represents a country summary statistic for each month")

# ggplotly(p_vac_cfr_box)
p_vac_cfr_box

We observe for each region a clear reduction in the CFR for COVID-19 associated with higher vaccination levels. This reduction occur both in the average value and in the variation across countries in each month. This reduction occurs in higher intensity with vaccination rates above 50%. We also observe that some regions are left unvaccinated, like Africa.

This figure seems to support our main hypothesis of reduction of CFR in countries with higher vaccination rates. Let’s construct statistical model to test this hypothesis formally.

5.2 Inferential analysis

The following table shows the fitted coefficients for the vaccination level, along with the confidence intervals at 95% level. Diagnostics of this model are presented in the appendix section, but overall they don’t support the evidence of over-dispersion, but there is some presence of outliers in the model.

We observe that countries with vaccination rates between 25%-50% have a 19% reduction in the CFR for COVID-19, compared to countries in the 0%-25% vaccination group. This reduction increases with higher vaccination rates, as the reduction for the group 50%-75% is 27% and for the group 75%-100% is 37%.

# Model
df_model <- df %>% 
  select(deaths,cfr,cases,vaccinated,vaccinated_level,
  vaccinated_level_10,location,month) %>% 
  mutate(location_month=paste0(month,location)) %>% 
  na.omit()

# Model with categorical vaccinated level
model.nb2 <- MASS::glm.nb(deaths~vaccinated_level+
                            location+month+offset(log(cases)), 
              data = df_model) 
# summary(model.nb2)
point_vac <- exp(model.nb2$coefficients[2:4])
ci_vac <- exp(confint(model.nb2,2:4, level=0.95))

# LR test
model.nb1 <- MASS::glm.nb(deaths~location+month+offset(log(cases)), 
              data = df_model) 
test_lr <- lmtest::lrtest(model.nb1,model.nb2)
estimates <- data.frame(mid=point_vac,
                        lower_bound=ci_vac[,1],
                        upper_bound=ci_vac[,2])
names(estimates) <- c("Estimated Coefficient",
                      "Lower Bound C.I. 95%",
                      "Upper Bound C.I. 95%")

estimates <- estimates %>% 
  rownames_to_column() %>% 
  rename(Variable=rowname) %>% 
  mutate(Variable=str_replace(Variable,"vaccinated_level",
                              "Vaccination Level: "))


estimates %>%
  rename(RR=`Estimated Coefficient`) %>% 
  flextable() %>% 
  autofit() %>% 
  colformat_double(digits=2) %>% 
  set_caption("Relative Risk (RR): Vaccination Level")
estimates %>% 
  mutate(Variable=str_remove(Variable,"Vaccination Level: ")) %>% 
  ggplot(aes(Variable, `Estimated Coefficient`))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                    ymax=`Upper Bound C.I. 95%`))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="Vaccination Level",y="Relative Risk (RR): Vaccination Level")

rm(estimates)


# Function to plot estimated coefficients along their error bar
f_plot_coef <- function(mod,
                        range=2:4,
                      var_name="vaccinated_level",
                      var_title="Vaccination Level",
                      title_plot=""){
  
  point <- exp(mod$coefficients[range])
  ci <- exp(confint(mod,range, level=0.95))
  estimates <- data.frame(mid=point,
                          lower_bound=ci[,1],
                          upper_bound=ci[,2])
  
  names(estimates) <- c("Estimated Coefficient",
                        "Lower Bound C.I. 95%",
                        "Upper Bound C.I. 95%")
  
  estimates <- estimates %>% 
    rownames_to_column() %>% 
    rename(Variable=rowname) %>% 
    mutate(Variable=str_replace(Variable,var_name,
                                paste0(var_title,": ")))
  
  # Plot
  estimates %>% 
    mutate(Variable=str_remove(Variable,paste0(var_title,": "))) %>% 
    ggplot(aes(Variable, `Estimated Coefficient`))+
    geom_point(col="red")+
    geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                      ymax=`Upper Bound C.I. 95%`))+
    coord_flip()+
    geom_hline(yintercept = 1, linetype="dashed")+
    labs(x=var_title, y=paste0("Relative Risk (RR): ",
                               var_title),title = title_plot)
}

The estimated coefficients represent the relative change with respect to the reference case (Vaccination level between 0% and 25%). We can see important reduction associated with higher vaccination levels, meaning that countries with higher vaccination levels have a lower CFR, an it is an increasing relationship. Also, as the C.I. don’t cross the reference line at 1, we can reject the null hypothesis of no effect of vaccination level on CFR at 95% level. The likelihood ratio test also rejected the null hypothesis of no vaccination level effect with a p-value below \(1.45*10^{-8}\).

Let’s present also a summary of the other fitted coefficients estimated for each country and for each month:

location_coefficients <- data.frame(coef=exp(model.nb2$coefficients)) %>% 
  rownames_to_column() %>% 
  filter(str_detect(rowname,"location")) %>% 
  mutate(location=str_remove(rowname,"location"))

# Plot with WHO region
country_region <- df %>% 
  group_by(region_name,location) %>% tally() %>% ungroup()

location_coefficients %>% 
  left_join(country_region) %>% 
  ggplot(aes(region_name,coef,col=region_name))+
  geom_jitter(alpha=.5)+
  geom_boxplot()+
  geom_hline(yintercept = 1, linetype="dashed")+
  coord_flip()+
  guides(col=F)+
  labs(x="WHO Region",col="",y="Relative Risk (RR): Country")

size_coef <- length(model.nb2$coefficients)

# get coefficients for month
point_month <- exp(model.nb2$coefficients[(size_coef-13):(size_coef)])
ci_month <- exp(confint(model.nb2,(size_coef-13):(size_coef), level=0.95))

estimates_month <- data.frame(mid=point_month,
                        lower_bound=ci_month[,1],
                        upper_bound=ci_month[,2])

names(estimates_month) <- c("Relative Risk (RR)",
                      "Lower Bound C.I. 95%",
                      "Upper Bound C.I. 95%")

estimates_month %>% 
  rownames_to_column() %>% 
  mutate(rowname=str_remove(rowname,"month"),
         month=factor(rowname, levels=rev(month_levels))) %>%  
  ggplot(aes(month, `Relative Risk (RR)`))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                    ymax=`Upper Bound C.I. 95%`))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="",y="Relative Risk (RR): Month. Reference variable: 2020-12")

Overall we observe some interesting patterns:

  • Countries in Europe have a much lower relative risk compared to other countries.
  • We do observe a reduction in time of the relative risk, with the presence of some peaks probably associated with new variants of COVID-19.

5.3 Sensitivity analysis

# OTHER MODELS ----

## Negative Binomial model - NUMERIC
model.nb <- MASS::glm.nb(deaths~vaccinated+location+month+offset(log(cases)),
              data = df_model)
# summary(model.nb)
# exp(model.nb$coefficients[2]*10)
# exp(confint(model.nb,"vaccinated", level=0.95)*10)

The model with vaccination level as numeric variable shows that a 10% increase in the percentage of vaccinated people results in a reduction of the COVID-19 CFR of 6.56% [C.I. 95%: -8.48%; -4.61%], a similar result that the one obtained in Liang et al. (2021): -7.6% [C.I. 95%: -12.6%; -2.7%].

Next is presented the model with 10 categories of vaccination level and the model only considering cases where the CFR was below 10%, to remove the presence of outliers.

## Poisson model
model_poisson <- glm(deaths~vaccinated_level+location+month+offset(log(cases)),
              data = df_model,
              family = "poisson")
# summary(model_poisson)
# vaccinated variable
# exp(model_poisson$coefficients[2:4])
# exp(confint(model_poisson,2:4, level=0.95))
p_poisson <- f_plot_coef(model_poisson,
                         title_plot = "Outcome as Poisson Distribution")

## 10 levels categorical model ------
model.10levels <- MASS::glm.nb(deaths~vaccinated_level_10+
                            location+month+offset(log(cases)), 
              data = df_model) 
# exp(model.10levels$coefficients[2:10])
# exp(confint(model.10levels,2:4, level=0.95))

p_10 <- f_plot_coef(model.10levels,2:10,
          var_name = "vaccinated_level_10",
          title_plot = "10 Categories for Vaccination Level")
# p_10
## without outliers (below 10% CFR) ------
model_noOutliers <- MASS::glm.nb(deaths~vaccinated_level+
                            location+month+offset(log(cases)), 
              data = filter(df_model,cfr<0.1))
# nobs(model_noOutliers)
# exp(model_noOutliers$coefficients[2:4])
# exp(confint(model_noOutliers,2:4, level=0.95))
p_noOut <- f_plot_coef(model_noOutliers,
                       title_plot = "Only CFR<10% considered")
# p_noOut
grid.arrange(p_10,p_noOut,nrow=1)

We can observe that the confidence interval differ among levels, due to the number of observations in each category. For example, not many countries have achieve a level above 90%, but even for this group the statistical evidence is strong enough to reject the null hypothesis of no effect of vaccines. We observe an expected result: the higher the vaccination rate the greater the reduction in the CFR for COVID-19.

For the model with removed observations, we got similar results as the original model, the greater the vaccination rate the lower the CFR for COVID-19. This model has narrow confidence interval, meaning that is more efficient in rejecting the null hypothesis of no effect.

Next are presented the relative risk of the mixed linear model, using country as a random effect. The confidence intervals for the RR of the original model estimated using bootstrap are also presented.

## country as random effects ------
library(lme4)
# takes forever to compute - option: load it
# model_random <- lme4::glmer.nb(deaths~(1|location)+vaccinated_level+
#                                  month+offset(log(cases)),
#                                data=df_model)
# saveRDS(model_random,"randomModel.rds")
model_random <- read_rds("../randomModel.rds")

# nobs(model_random)
# ranef(model_random)
point <- exp(fixef(model_random)[2:4])
ci <- exp(confint(model_random,3:5,method="Wald", level=0.95))

estimates <- data.frame(mid=point,
                          lower_bound=ci[,1],
                          upper_bound=ci[,2])
  
names(estimates) <- c("Estimated Coefficient",
                      "Lower Bound C.I. 95%",
                      "Upper Bound C.I. 95%")

estimates <- estimates %>% 
  rownames_to_column() %>% 
  rename(Variable=rowname) %>% 
  mutate(Variable=str_replace(Variable,"vaccinated_level",
                              paste0("Vaccinated Level",": ")))
# Plot
p_random <- estimates %>% 
  mutate(Variable=str_remove(Variable,paste0("Vaccinated Level",": "))) %>% 
  ggplot(aes(Variable, `Estimated Coefficient`))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                    ymax=`Upper Bound C.I. 95%`))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
                             "Vaccinated Level"),
       title = "Mixed Linear Effects Model (Country as Random)")
# Bootstrap
# Source: https://stackoverflow.com/questions/54749641/bootstrapping-with-glm-model
# it took really really long!
# UNCOMMENT TO GENERATE THE OBJECT AGAIN if not, load it


# data structure for results
# nboot <- 1000
# bres <- matrix(NA,
#                nrow=nboot,
#                ncol=4,
#                dimnames=list(rep=seq(nboot),
#                              coef=names(coef(model.nb2))[1:4]))
# # bootstrap
# set.seed(101)
# bootsize <- nrow(df_model)
# df_boot <- df_model
# for (i in seq(nboot)) {
#   bdat <- df_boot[sample(nrow(df_boot),size=bootsize,replace=TRUE),]
#   bfit <- update(model.nb2, data=bdat)  ## refit with new data
#   bres[i,] <- coef(bfit)[1:4]
# }
# # output

# saveRDS(bres,"boot.rds")
bres <- readRDS ("../boot.rds")

boot_estimates <-
  data.frame(mean_est=colMeans(exp(bres)),
           t(apply(exp(bres),2,quantile,c(0.025,0.975))))

# plot
p_boot <- boot_estimates[2:4,] %>% 
  rownames_to_column() %>% 
  mutate(Variable=str_remove(rowname,"vaccinated_level")) %>% 
  ggplot(aes(Variable, mean_est))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=X2.5.,
                    ymax=X97.5.))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
                             "Vaccinated Level"),
       title = "Bootstrap estimates of Original Model")
grid.arrange(p_random,p_boot,nrow=1)

We observe the same pattern as before, a higher vaccination rate is associate it with a lower CFR for COVID-19. We notice small differences in the estimates and in the confidence intervals, but in both cases we can reject the null hypothesis that the vaccination has no effect on the CFR for COVID-19.

6 Discussion

This project seeks to study the effect of vaccination levels in reducing the CFR for COVID-19. The main findings are that higher vaccination rates are associated with lower CFR for COVID-19, making vaccination a good policy strategy to pursue to reduce the death toll of the pandemic. The obtained results are in line with previous studies conducted (Liang et al., 2021), and provide further evidence in the efficacy of vaccination in protecting human lives.

The model suffers from some inherent problems, that future studies could explore more in deep to improve the robustness of the results:

  • Spatial correlation: COVID-19 data presents some spatial autocorrelation, as the disease is infectious and spreader along neighbor countries through immigration. This effect is partially controlled by adding a dummy variable for each country.
  • Differences in COVID-19 cases and deaths detection: The testing capabilities differ from each countries, so the data relative to cases and deaths attributed to COVID-19 is only the best guess of each country, and may differ significantly. Other studies could overcome this issue by analyzing the excess deaths attribute to COVID-19, rather than the identified deaths.
  • Time series autocorrelation: The panel data could suffer from time series auto-correlation, as observations depend on the previous state of each country.
  • Temporal change in characteristics across countries: The dummy variable per country added to the model control for each country individual characteristics, but under the assumption that these traits haven’t suffered major changes in the time period analyzed (2021-12 to 2022-2).
  • Observational unit: The data used correspond to monthly summaries for each country, so all inference about vaccines effectiveness in reducing the CFR for COVID-19 are only valid for this unit of analysis. This study cannot infer anything about the effectiveness of vaccines at the individual level, nor provide an statement about the causal effect of vaccines. Other studies have proved through clinical randomized trials the individual efficacy of the vaccines (Polack et al., 2020)

7 Conclusion

A statistical significant relationship was found, a higher vaccination rate reduce the relative risk of COVID-19. In comparison to the group with 0%-25% vaccinated people, countries in the 25%-50% group have a 19% [C.I. 95%: -27%;-11%] reduction in the CFR for COVID-19, countries in the 50%-75% group a 27% reduction [-35%;-18%] and countries in the 75%-100% group a 37% reduction [-48%;-24%]. Additionally, a a 10% increase in the percentage of vaccinated people results in a reduction of the COVID-19 CFR of 6.56% [C.I. 95%: -8.48%;-4.61%]. Potential limitations of the analysis are due to the potential presence of time and spatial autocorrelation in the data.

There is strong statistical evidence pointing in the association of countries with higher COVID-19 vaccination rates and lower case mortality rate due to COVID-19. This implies that each country with current lower vaccination rates should promote a vaccination strategy in this year to reduce the death burden of the on-going COVID-19 pandemic.

Acknowledgement

I acknowledge the instructor Professor Shizhe Chen and his materials for statistical methods of research. I also acknowledge the following classmates for their valuable feedback and comments during the project development: Yinan Cheng, Shuyu Guo, Kyung Jin Lee, Katherine Cheng, Oscar Rivera and Jedidiah Harwood.

Reference

  • Course Notes STA207 UC Davis Winter Quarter 2022. Professor Shizhe Chen.
  • World Health Organization (WHO). (2022). WHO Coronavirus (COVID-19) Data. See: https://covid19.who.int/info
  • Our World in Data. (2022). Coronavirus (COVID-19) Vaccinations Data. https://ourworldindata.org/covid-vaccinations
  • Liang, L. L., Kuo, H. S., Ho, H. J., & Wu, C. Y. (2021). COVID-19 vaccinations are associated with reduced fatality rates: Evidence from cross-county quasi-experiments. Journal of Global Health, 11.
  • Passarelli-Araujo, H., Pott-Junior, H., Susuki, A. M., Olak, A. S., Pescim, R. R., Tomimatsu, M. F., … & Urbano, M. R. (2022). The impact of COVID-19 vaccination on case fatality rates in a city in Southern Brazil. American Journal of Infection Control.
  • Haldar, A., & Sethi, N. (2020). The effect of country-level factors and government intervention on the incidence of COVID-19. Asian Economics Letters, 1(2), 17804.
  • Florian Hartig (2022). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.4.5. http://florianhartig.github.io/DHARMa/
  • Polack, F. P., Thomas, S. J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., … & Gruber, W. C. (2020). Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. New England Journal of Medicine.

8 Appendix - Additional Analysis/Figures

8.1 Distribution of the outcome variable

The following figure shows the distribution of the CFR per WHO region, with no transformation and with a logarithmic transformation. We can see that applying a logarithmic transformation on the outcome variable help the distribution become more Normally distributed.

df %>% 
  mutate(log_cfr=log(cfr)) %>% 
  pivot_longer(c(cfr, log_cfr), names_to = "key", values_to = "value") %>% 
  mutate(key=str_replace(key,"log_cfr","Log(CFR)"),
         key=str_replace(key,"cfr","CFR")) %>% 
  ggplot(aes(value,fill=region_name))+
  geom_density(alpha=.5)+
  facet_wrap(region_name~key,scales = "free")+
  labs(y="Density",x="Case Fatality Rate COVID-19 [%]",
       fill="Region Name",
       caption = "Cumulated cases and deaths for the period December 2020 to July 2022")

8.2 Density of Vaccination status

The following figure shows the distribution of the vaccination status per country, showing that within each WHO region we observe a bi-modal distribution. Except for Africa region, there are two groups of countries: ones with high vaccination rate and ones with lower vaccination rate.

# density of vaccination
df %>% 
  filter(month=="2022-7") %>% #July 2022 
  ggplot(aes(vaccinated,fill=region_name))+
  geom_density(alpha=.5)+
  facet_wrap(~region_name,scales = "free")+
  labs(y="Density",x="People Fully Vaccinated per 100 people",
       fill="Region Name",
       caption = "Vaccination status at July 2022")

9 Appendix - Model Diagnostics

# Negative Binomial Model - Vaccination as categorical variable
# par(mfrow = c(2, 2))
# plot(model.nb2)
# boot::glm.diag.plots(model.nb2)

Here is the conducted test on dispersion, to see whether the data fitted to the model is more/less dispersed than expected. The test tells us that cannot reject the null hypothesis of no presence of over/under dispersion in the model. This means that the dispersion of the model is in line for what is expected for a negative binomial distribution.

# Dispersion test
# tests if the simulated dispersion is equal to the observed dispersion
a <- testDispersion(model.nb2) #looks good

Next we conduct a QQ-plot for a generalized linear model:

plotQQunif(model.nb2)

We observe that the dispersion of the residuals of the model is as expected for a negative binomial distribution. The outlier test reveals the presence of potential observations that may have an unusual value, which could be problematic for the model. Based on this, an additional model removing observations with values of CFR above 10% was conducted, which gives similar results to the original model.

# testZeroinflation() - tests if there are more zeros in the data than expected from the simulations
# testZeroInflation(model.nb2)

# testTemporalAutocorrelation() - tests for temporal autocorrelation in the residuals
# simulationOutput <- simulateResiduals(fittedModel = model.nb2)
# testTemporalAutocorrelation(simulationOutput,
#                             df_model$location_month)

# testOutliers() - tests if there are more simulation outliers than expectedtestOutliers() - tests if there are more simulation outliers than expected
# a <- testOutliers(model.nb2)
# testOutliers(model_noOutliers)
# influence plot
# outliers to remove
# car::influencePlot(model.nb2)
# 
# 
# df_aux <- df_model[c(-615,-631,-1285,-1332),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
# 
# df_aux <- df_aux[c(-40,-456,-766,-822,-1508),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
# 
# df_aux <- df_aux[c(-327,-835,-859,-931,-1546),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
# 
# df_aux <- df_aux[c(-177,-627,-1040,-1274,-1355,-1439),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)

Appendix - Session info

sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lme4_1.1-28     Matrix_1.4-0    DHARMa_0.4.5    flextable_0.7.0
##  [5] skimr_2.1.3     plotly_4.10.0   gridExtra_2.3   lubridate_1.8.0
##  [9] scales_1.1.1    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.8    
## [13] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.6   
## [17] ggplot2_3.3.5   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-155      fs_1.5.2          bit64_4.0.5       httr_1.4.2       
##  [5] repr_1.1.4        tools_4.1.3       backports_1.4.1   bslib_0.3.1      
##  [9] utf8_1.2.2        R6_2.5.1          DBI_1.1.2         lazyeval_0.2.2   
## [13] colorspace_2.0-3  withr_2.5.0       tidyselect_1.1.2  curl_4.3.2       
## [17] bit_4.0.4         compiler_4.1.3    cli_3.2.0         rvest_1.0.2      
## [21] xml2_1.3.3        officer_0.4.1     labeling_0.4.2    sass_0.4.0       
## [25] lmtest_0.9-39     systemfonts_1.0.4 digest_0.6.27     minqa_1.2.4      
## [29] rmarkdown_2.13    base64enc_0.1-3   pkgconfig_2.0.3   htmltools_0.5.2  
## [33] highr_0.9         dbplyr_2.1.1      fastmap_1.1.0     htmlwidgets_1.5.4
## [37] rlang_1.0.2       readxl_1.3.1      rstudioapi_0.13   farver_2.1.0     
## [41] jquerylib_0.1.4   generics_0.1.2    zoo_1.8-9         jsonlite_1.8.0   
## [45] crosstalk_1.2.0   vroom_1.5.7       zip_2.2.0         magrittr_2.0.1   
## [49] Rcpp_1.0.8.3      munsell_0.5.0     fansi_1.0.2       gdtools_0.2.4    
## [53] lifecycle_1.0.1   stringi_1.7.6     yaml_2.3.5        MASS_7.3-55      
## [57] grid_4.1.3        parallel_4.1.3    crayon_1.5.0      lattice_0.20-45  
## [61] haven_2.4.3       splines_4.1.3     hms_1.1.1         knitr_1.37       
## [65] pillar_1.7.0      uuid_1.0-4        boot_1.3-28       reprex_2.0.1     
## [69] glue_1.6.2        evaluate_0.15     data.table_1.14.2 modelr_0.1.8     
## [73] vctrs_0.3.8       nloptr_2.0.0      tzdb_0.2.0        cellranger_1.1.0 
## [77] gtable_0.3.0      assertthat_0.2.1  xfun_0.30         broom_0.7.12     
## [81] viridisLite_0.4.0 ellipsis_0.3.2    gap_1.2.3-1

Appendix - R Code used

# knitr::opts_chunk$set(fig.pos = 'H')
# knitr::opts_chunk$set(fig.pos = 'H', message=F, echo = T, warning=F)
knitr::opts_chunk$set(fig.pos = 'H', 
                      fig.width=12, fig.height=8,
                      message=F, echo = T, warning=F)
# libraries
library(tidyverse) # data manipulation
library(scales) 
library(lubridate) #time format
library(gridExtra) # plot manipulation
library(plotly) # interactive plots
library(skimr) # summary statistics
library(flextable) #table format for hhtml
library(DHARMa) # Simulations test for GLM
theme_set(theme_bw(16)+theme(panel.grid.major = element_blank()))
# COVID-19 Data -----
# load WHO COVID-19 data
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")

# Data wrangling (preparation for analysis)

# shift deaths two weeks early
covid <- covid %>% 
  group_by(Country) %>% 
  mutate(deaths_shift=lead(New_deaths,14,
                          order_by = Date_reported)) %>% 
  ungroup()


# filter to remove March 2022 - Updated to August
covid <- covid %>% filter(Date_reported<"2022-08-01") 

## Create time factor variable, splitting years in half
halfYear_levels <- c("2020-1","2020-2","2021-1",
                     "2021-2","2022-1","2022-2")
month_levels <- c("2020-12","2021-1","2021-2","2021-3",
                  "2021-4","2021-5","2021-6","2021-7","2021-8",
                  "2021-9","2021-10","2021-11","2021-12",
                  "2022-1","2022-2","2022-3","2022-4","2022-5",
                  "2022-6","2022-7")

covid <- covid %>%  
  mutate(period=case_when(
    Date_reported < as.Date("2020-07-01") ~ "2020-1",
    Date_reported < as.Date("2021-01-01") ~ "2020-2",
    Date_reported < as.Date("2021-07-01") ~ "2021-1",
    Date_reported < as.Date("2022-01-01") ~ "2021-2",
    Date_reported < as.Date("2022-07-01") ~ "2022-1",
    Date_reported < as.Date("2023-01-01") ~ "2022-2",
    T ~ "other") %>% factor(levels=halfYear_levels),
    month=paste(year(Date_reported),month(Date_reported),sep="-") %>% 
      factor(levels =month_levels))


# clean data: remove negative cases and deaths
covid <- covid %>% 
  mutate(New_deaths=deaths_shift) %>% 
  filter(New_cases>=0 & New_deaths>=0)

## need to summarize COVID data to periods of time: MONTHS
covid_period <- covid %>% 
  group_by(WHO_region,Country,month) %>% 
  summarise(cases=sum(New_cases),
            deaths=sum(New_deaths)) %>% ungroup() %>% 
  filter(cases>99 & deaths>9) %>%  # filter to remove unreliable periods
  mutate(cfr=deaths/cases)  # we calculate CFR for each given month

# VACCINATION DATA -----
## Load vaccination data
vaccination <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv")

# Filter to remove March 2022 - Updated to August 2022
vaccination <- vaccination %>% 
  filter(date<"2022-08-01")

# We summarized vaccination data based on the a time period: MONTH
## Create time factor variable, splitting years in half
vaccination <- vaccination %>% 
  mutate(period=case_when(
    date < as.Date("2020-07-01") ~ "2020-1",
    date < as.Date("2021-01-01") ~ "2020-2",
    date < as.Date("2021-07-01") ~ "2021-1",
    date < as.Date("2022-01-01") ~ "2021-2",
    date < as.Date("2022-07-01") ~ "2022-1",
    date < as.Date("2023-01-01") ~ "2022-2",
    T ~ "other") %>% factor(levels=halfYear_levels),
    month=paste(year(date),month(date),sep="-") %>% factor(levels=month_levels))

# take the average over the period: NA is 0
vaccination_period <- vaccination %>% 
  mutate(vaccinated=replace_na(people_fully_vaccinated_per_hundred,0)) %>% 
  group_by(location,month) %>% 
  summarise(vaccinated=mean(vaccinated,na.rm=T)) %>% ungroup() 

# JOIN DATA -----

## Change countries name to increase matching (manually, based on visual inspection)
covid <- covid %>% 
  mutate(Country=case_when(
    Country=="Bolivia (Plurinational State of)" ~ "Bolivia",
    Country=="Bonaire" ~ "Bonaire Sint Eustatius and Saba",
    Country=="Brunei Darussalam" ~ "Brunei",
    Country=="Cabo Verde" ~ "Cape Verde",
    Country=="Côte d’Ivoire" ~ "Cote d'Ivoire",
    # str_detect(Country,"C.te d.Ivoire") ~ "Cote d'Ivoire", 
    Country=="Curaçao" ~ "Curacao",
    # str_detect(Country,"Cura.ao") ~ "Curacao",
    Country=="Democratic Republic of the Congo" ~ "Democratic Republic of Congo",
    Country=="Falkland Islands (Malvinas)" ~ "Falkland Islands",
    Country=="Faroe Islands" ~ "Faeroe Islands",
    Country=="Iran (Islamic Republic of)" ~ "Iran",
    Country=="Kosovo[1]" ~ "Kosovo",
    Country=="Lao People's Democratic Republic" ~ "Laos",
    Country=="occupied Palestinian territory, including east Jerusalem" ~ "Palestine",
    Country=="Pitcairn Islands" ~ "Pitcairn",
    Country=="Republic of Korea" ~ "South Korea",
    Country=="Republic of Moldova" ~ "Moldova",
    Country=="Russian Federation" ~ "Russia",
    Country=="Sint Maarten" ~ "Sint Maarten (Dutch part)",
    Country=="Syrian Arab Republic" ~ "Syria",
    Country=="The United Kingdom" ~ "United Kingdom",
    Country=="Timor-Leste" ~ "Timor",
    Country=="United Republic of Tanzania" ~ "Tanzania",
    Country=="United States of America" ~ "United States",
    Country=="Venezuela (Bolivarian Republic of)" ~ "Venezuela",
    Country=="Viet Nam" ~ "Vietnam",
    T ~ Country))

# Join countries from both data sources
# countries_covid <- unique(covid$Country)
# countries_vaccine <- unique(vaccination$location)
# 
# countries_vaccine <- data.frame(countries_vaccine=countries_vaccine,
#                              x=1)
# countries_covid <- data.frame(countries_covid=countries_covid,
#                               y=1)

# are the codes similar?
# countries_vaccine %>%
#   left_join(countries_covid,by=c("countries_vaccine"="countries_covid")) %>%
#   pull(y) %>% sum(na.rm=T)
# 215 countries match


# join based on location - I join it to the Vaccination, so I can get months with vaccine information for each country
df <- vaccination_period %>% 
  left_join(covid_period, by=c("location"="Country",
                               "month")) %>% 
  filter(cases>=0 & deaths>=0 & !(is.na(vaccinated)))

## More convenient labels for region
df <- df %>% 
  mutate(region_name=case_when(
    WHO_region=="EMRO" ~ "Eastern Mediterranean",
    WHO_region=="EURO" ~ "Europe",
    WHO_region=="AFRO" ~ "Africa",
    WHO_region=="AMRO" ~ "Americas",
    WHO_region=="WPRO" ~ "Western Pacific",
    WHO_region=="SEARO" ~ "South-East Asia") %>%
      factor(levels=c("Americas","Europe","Eastern Mediterranean",
                      "Western Pacific","South-East Asia","Africa")),
  location=as.factor(location))

# Vaccination level factor variable
df <- df %>% 
  mutate(vaccinated_level=case_when(
    vaccinated<25 ~ "0%-25%",
    vaccinated<50 ~ "25%-50%",
    vaccinated<75 ~ "50%-75%",
    vaccinated<100 ~ "75%-100%",
    T ~"other") %>% factor(levels=c("0%-25%","25%-50%",
                                    "50%-75%","75%-100%")),
    vaccinated_level_10=case_when(
    vaccinated<10 ~ "0%-10%",
    vaccinated<20 ~ "10%-20%",
    vaccinated<30 ~ "20%-30%",
    vaccinated<40 ~ "30%-40%",
    vaccinated<50 ~ "40%-50%",
    vaccinated<60 ~ "50%-60%",
    vaccinated<70 ~ "60%-70%",
    vaccinated<80 ~ "70%-80%",
    vaccinated<90 ~ "80%-90%",
    vaccinated<101 ~ "90%-100%",
    T ~"other") %>% factor(levels=c("0%-10%","10%-20%","20%-30%",
                                    "30%-40%","40%-50%","50%-60%",
                                    "60%-70%","70%-80%","80%-90%",
                                    "90%-100%")))
# df$vaccinated_level_10 %>% table()
# function to estimate summary statistics:
f.sum <- function(x){
  return(
    data.frame(
     Mean=mean(x,na.rm=T),
     sd=sd(x,na.rm=T),
     Min=min(x,na.rm=T),
     Median=quantile(x,0.5,na.rm=T),
     Max=max(x,na.rm = T)
    )
  )
}

# average number of months per county:
avg_country_month <- df %>% 
  group_by(location) %>% tally()

# Table with summary statistics
rbind(`Number of months per country`=f.sum(avg_country_month$n),
      `COVID-19 Cases per Month`=f.sum(df$cases),
      `COVID-19 Deaths per Month`=f.sum(df$deaths),
      `Case Fatality Rate (CFR) %`=f.sum(df$cfr*100),
      `People Fully Vaccinated (%)`=f.sum(df$vaccinated)) %>% 
  rownames_to_column() %>% rename(Metric=rowname) %>% 
  flextable() %>% autofit() %>% 
  colformat_double(digits=2) %>% 
  colformat_double(i=2:3,digits=0) %>% 
  bold(part = "header") %>% 
  set_caption(paste0(
    "Summary Metrics: Monthly observations per country. n = ",nrow(df))) %>% 
  footnote(i=1,j=1,
           as_paragraph(paste0(length(unique(df$location)),
                               " Countries considered")))

rm(avg_country_month)
#### Time series of COVID-19 Cases and deaths
p_region <- df %>% 
  pivot_longer(c(cases,deaths), names_to = "key", values_to = "value") %>% 
  ggplot(aes(month,value))+
  # geom_line()+
  geom_boxplot()+
  facet_wrap(~key,ncol=1,scales = "free_y")+
  labs(x="Date",y="")+
  scale_y_continuous(labels=function(x) format(x, 
                                               big.mark = " ", 
                                               scientific = FALSE),
                     trans = "log10")
# p_region


p_cfr_time <- df %>% 
  ggplot(aes(month,cfr))+
  geom_boxplot()+
   scale_y_continuous(labels = scales::percent,
                      limits = c(0,0.1))+
  labs(x="Date",y="Case Fatality Rate COVID-19 [%]")+
  theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_cfr_time

p_vacc_time <- df %>% 
  ggplot(aes(month,vaccinated))+
  geom_boxplot()+
  labs(x="Date",y="People Fully Vaccinated per 100 people")+
  theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_vacc_time

## combine graphs
# grid.arrange(p_cfr_time,p_vacc_time,ncol=1)
p_cfr_all<- df %>% 
  filter(WHO_region!="Other") %>% 
  ggplot(aes(region_name,cfr))+
  geom_jitter(alpha=0.5,aes(text=paste0("Location: ",location,
                                        "\n Month: ",month)))+
  geom_boxplot(alpha=0.5,outlier.alpha = 0)+
  coord_flip()+
  scale_y_continuous(labels = scales::percent, limits=c(0,0.1))+
  labs(x="WHO Region",y="Case Fatality Rate COVID-19 [%]",
       caption = "Cumulated cases and deaths for the period December 2020 to July 2022 \n Only CFR below 10% are shown")

ggplotly(p_cfr_all)

# percentage above 10%
# sum(nrow(filter(df,cfr>0.1)))/nrow(df)*100
# Time series CFR - NEED TO CONSIDER WEIGHTED AVERAGE
p_cfr_time <- df %>% 
  group_by(month,region_name) %>%
  summarize(cfr=mean(cfr,na.rm=T)) %>%
  ggplot(aes(month,cfr,
             col=region_name,
             group=region_name
             ))+
  geom_line(size=1.5)+
  # geom_boxplot()+
  scale_y_continuous(labels = scales::percent,
                     limits = c(0,0.1),
                     # breaks=c(0,0.05,0.1)
                     )+
  theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
  labs(x="",y="Case Fatality Rate COVID-19 [%]",col="WHO Region")

ggplotly(p_cfr_time)

p_vacc_all <- df %>% 
  filter(month=="2022-7") %>% #July 2022 
  ggplot(aes(region_name,vaccinated))+
  geom_boxplot(alpha=0.5,outlier.alpha = 0)+
  geom_jitter(alpha=0.5, size=2,
              aes(text=paste0("Location: ",location)))+
  coord_flip()+
  labs(x="WHO Region",y="People Fully Vaccinated [%]",
       caption = "Vaccination status at July 2022")

ggplotly(p_vacc_all)
# time series of vaccination - NEED TO CONSIDER WEIGHTED AVERAGE
# df %>% 
#   group_by(month,region_name) %>%
#   summarize(vaccinated=mean(vaccinated,na.rm=T)) %>%
#   ggplot(aes(month,vaccinated,
#              col=region_name,
#              group=region_name
#              ))+
#   geom_line()+
#   # geom_boxplot()+
#   theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
#   labs(x="",y="People Fully Vaccinated per 100 people",col="WHO Region")
p_vac_cfr_box <- df %>% 
  filter(!(is.na(WHO_region))) %>% 
  ggplot(aes(region_name,cfr,col=vaccinated_level))+
  geom_boxplot()+
  scale_y_continuous(labels = scales::percent,
                     limits = c(0,0.1)
  )+
  coord_flip()+
  # theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
  labs(col="Percentage of People \n Fully Vaccinated [%]",
       x="WHO Region",
       y="Case Fatality Rate COVID-19 [%]",
       caption="Each point represents a country summary statistic for each month")

# ggplotly(p_vac_cfr_box)
p_vac_cfr_box
# Model
df_model <- df %>% 
  select(deaths,cfr,cases,vaccinated,vaccinated_level,
  vaccinated_level_10,location,month) %>% 
  mutate(location_month=paste0(month,location)) %>% 
  na.omit()

# Model with categorical vaccinated level
model.nb2 <- MASS::glm.nb(deaths~vaccinated_level+
                            location+month+offset(log(cases)), 
              data = df_model) 
# summary(model.nb2)
point_vac <- exp(model.nb2$coefficients[2:4])
ci_vac <- exp(confint(model.nb2,2:4, level=0.95))

# LR test
model.nb1 <- MASS::glm.nb(deaths~location+month+offset(log(cases)), 
              data = df_model) 
test_lr <- lmtest::lrtest(model.nb1,model.nb2)

estimates <- data.frame(mid=point_vac,
                        lower_bound=ci_vac[,1],
                        upper_bound=ci_vac[,2])
names(estimates) <- c("Estimated Coefficient",
                      "Lower Bound C.I. 95%",
                      "Upper Bound C.I. 95%")

estimates <- estimates %>% 
  rownames_to_column() %>% 
  rename(Variable=rowname) %>% 
  mutate(Variable=str_replace(Variable,"vaccinated_level",
                              "Vaccination Level: "))


estimates %>%
  rename(RR=`Estimated Coefficient`) %>% 
  flextable() %>% 
  autofit() %>% 
  colformat_double(digits=2) %>% 
  set_caption("Relative Risk (RR): Vaccination Level")

estimates %>% 
  mutate(Variable=str_remove(Variable,"Vaccination Level: ")) %>% 
  ggplot(aes(Variable, `Estimated Coefficient`))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                    ymax=`Upper Bound C.I. 95%`))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="Vaccination Level",y="Relative Risk (RR): Vaccination Level")
rm(estimates)


# Function to plot estimated coefficients along their error bar
f_plot_coef <- function(mod,
                        range=2:4,
                      var_name="vaccinated_level",
                      var_title="Vaccination Level",
                      title_plot=""){
  
  point <- exp(mod$coefficients[range])
  ci <- exp(confint(mod,range, level=0.95))
  estimates <- data.frame(mid=point,
                          lower_bound=ci[,1],
                          upper_bound=ci[,2])
  
  names(estimates) <- c("Estimated Coefficient",
                        "Lower Bound C.I. 95%",
                        "Upper Bound C.I. 95%")
  
  estimates <- estimates %>% 
    rownames_to_column() %>% 
    rename(Variable=rowname) %>% 
    mutate(Variable=str_replace(Variable,var_name,
                                paste0(var_title,": ")))
  
  # Plot
  estimates %>% 
    mutate(Variable=str_remove(Variable,paste0(var_title,": "))) %>% 
    ggplot(aes(Variable, `Estimated Coefficient`))+
    geom_point(col="red")+
    geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                      ymax=`Upper Bound C.I. 95%`))+
    coord_flip()+
    geom_hline(yintercept = 1, linetype="dashed")+
    labs(x=var_title, y=paste0("Relative Risk (RR): ",
                               var_title),title = title_plot)
}
location_coefficients <- data.frame(coef=exp(model.nb2$coefficients)) %>% 
  rownames_to_column() %>% 
  filter(str_detect(rowname,"location")) %>% 
  mutate(location=str_remove(rowname,"location"))

# Plot with WHO region
country_region <- df %>% 
  group_by(region_name,location) %>% tally() %>% ungroup()

location_coefficients %>% 
  left_join(country_region) %>% 
  ggplot(aes(region_name,coef,col=region_name))+
  geom_jitter(alpha=.5)+
  geom_boxplot()+
  geom_hline(yintercept = 1, linetype="dashed")+
  coord_flip()+
  guides(col=F)+
  labs(x="WHO Region",col="",y="Relative Risk (RR): Country")
size_coef <- length(model.nb2$coefficients)

# get coefficients for month
point_month <- exp(model.nb2$coefficients[(size_coef-13):(size_coef)])
ci_month <- exp(confint(model.nb2,(size_coef-13):(size_coef), level=0.95))

estimates_month <- data.frame(mid=point_month,
                        lower_bound=ci_month[,1],
                        upper_bound=ci_month[,2])

names(estimates_month) <- c("Relative Risk (RR)",
                      "Lower Bound C.I. 95%",
                      "Upper Bound C.I. 95%")

estimates_month %>% 
  rownames_to_column() %>% 
  mutate(rowname=str_remove(rowname,"month"),
         month=factor(rowname, levels=rev(month_levels))) %>%  
  ggplot(aes(month, `Relative Risk (RR)`))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                    ymax=`Upper Bound C.I. 95%`))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="",y="Relative Risk (RR): Month. Reference variable: 2020-12")
# OTHER MODELS ----

## Negative Binomial model - NUMERIC
model.nb <- MASS::glm.nb(deaths~vaccinated+location+month+offset(log(cases)),
              data = df_model)
# summary(model.nb)
# exp(model.nb$coefficients[2]*10)
# exp(confint(model.nb,"vaccinated", level=0.95)*10)
## Poisson model
model_poisson <- glm(deaths~vaccinated_level+location+month+offset(log(cases)),
              data = df_model,
              family = "poisson")
# summary(model_poisson)
# vaccinated variable
# exp(model_poisson$coefficients[2:4])
# exp(confint(model_poisson,2:4, level=0.95))
p_poisson <- f_plot_coef(model_poisson,
                         title_plot = "Outcome as Poisson Distribution")

## 10 levels categorical model ------
model.10levels <- MASS::glm.nb(deaths~vaccinated_level_10+
                            location+month+offset(log(cases)), 
              data = df_model) 
# exp(model.10levels$coefficients[2:10])
# exp(confint(model.10levels,2:4, level=0.95))

p_10 <- f_plot_coef(model.10levels,2:10,
          var_name = "vaccinated_level_10",
          title_plot = "10 Categories for Vaccination Level")
# p_10
## without outliers (below 10% CFR) ------
model_noOutliers <- MASS::glm.nb(deaths~vaccinated_level+
                            location+month+offset(log(cases)), 
              data = filter(df_model,cfr<0.1))
# nobs(model_noOutliers)
# exp(model_noOutliers$coefficients[2:4])
# exp(confint(model_noOutliers,2:4, level=0.95))
p_noOut <- f_plot_coef(model_noOutliers,
                       title_plot = "Only CFR<10% considered")
# p_noOut
grid.arrange(p_10,p_noOut,nrow=1)
## country as random effects ------
library(lme4)
# takes forever to compute - option: load it
# model_random <- lme4::glmer.nb(deaths~(1|location)+vaccinated_level+
#                                  month+offset(log(cases)),
#                                data=df_model)
# saveRDS(model_random,"randomModel.rds")
model_random <- read_rds("../randomModel.rds")

# nobs(model_random)
# ranef(model_random)
point <- exp(fixef(model_random)[2:4])
ci <- exp(confint(model_random,3:5,method="Wald", level=0.95))

estimates <- data.frame(mid=point,
                          lower_bound=ci[,1],
                          upper_bound=ci[,2])
  
names(estimates) <- c("Estimated Coefficient",
                      "Lower Bound C.I. 95%",
                      "Upper Bound C.I. 95%")

estimates <- estimates %>% 
  rownames_to_column() %>% 
  rename(Variable=rowname) %>% 
  mutate(Variable=str_replace(Variable,"vaccinated_level",
                              paste0("Vaccinated Level",": ")))
# Plot
p_random <- estimates %>% 
  mutate(Variable=str_remove(Variable,paste0("Vaccinated Level",": "))) %>% 
  ggplot(aes(Variable, `Estimated Coefficient`))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
                    ymax=`Upper Bound C.I. 95%`))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
                             "Vaccinated Level"),
       title = "Mixed Linear Effects Model (Country as Random)")
# Bootstrap
# Source: https://stackoverflow.com/questions/54749641/bootstrapping-with-glm-model
# it took really really long!
# UNCOMMENT TO GENERATE THE OBJECT AGAIN if not, load it


# data structure for results
# nboot <- 1000
# bres <- matrix(NA,
#                nrow=nboot,
#                ncol=4,
#                dimnames=list(rep=seq(nboot),
#                              coef=names(coef(model.nb2))[1:4]))
# # bootstrap
# set.seed(101)
# bootsize <- nrow(df_model)
# df_boot <- df_model
# for (i in seq(nboot)) {
#   bdat <- df_boot[sample(nrow(df_boot),size=bootsize,replace=TRUE),]
#   bfit <- update(model.nb2, data=bdat)  ## refit with new data
#   bres[i,] <- coef(bfit)[1:4]
# }
# # output

# saveRDS(bres,"boot.rds")
bres <- readRDS ("../boot.rds")

boot_estimates <-
  data.frame(mean_est=colMeans(exp(bres)),
           t(apply(exp(bres),2,quantile,c(0.025,0.975))))

# plot
p_boot <- boot_estimates[2:4,] %>% 
  rownames_to_column() %>% 
  mutate(Variable=str_remove(rowname,"vaccinated_level")) %>% 
  ggplot(aes(Variable, mean_est))+
  geom_point(col="red")+
  geom_errorbar(aes(ymin=X2.5.,
                    ymax=X97.5.))+
  coord_flip()+
  geom_hline(yintercept = 1, linetype="dashed")+
  labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
                             "Vaccinated Level"),
       title = "Bootstrap estimates of Original Model")
grid.arrange(p_random,p_boot,nrow=1)
df %>% 
  mutate(log_cfr=log(cfr)) %>% 
  pivot_longer(c(cfr, log_cfr), names_to = "key", values_to = "value") %>% 
  mutate(key=str_replace(key,"log_cfr","Log(CFR)"),
         key=str_replace(key,"cfr","CFR")) %>% 
  ggplot(aes(value,fill=region_name))+
  geom_density(alpha=.5)+
  facet_wrap(region_name~key,scales = "free")+
  labs(y="Density",x="Case Fatality Rate COVID-19 [%]",
       fill="Region Name",
       caption = "Cumulated cases and deaths for the period December 2020 to July 2022")
# density of vaccination
df %>% 
  filter(month=="2022-7") %>% #July 2022 
  ggplot(aes(vaccinated,fill=region_name))+
  geom_density(alpha=.5)+
  facet_wrap(~region_name,scales = "free")+
  labs(y="Density",x="People Fully Vaccinated per 100 people",
       fill="Region Name",
       caption = "Vaccination status at July 2022")
# Negative Binomial Model - Vaccination as categorical variable
# par(mfrow = c(2, 2))
# plot(model.nb2)
# boot::glm.diag.plots(model.nb2)
# Dispersion test
# tests if the simulated dispersion is equal to the observed dispersion
a <- testDispersion(model.nb2) #looks good
plotQQunif(model.nb2)
# testZeroinflation() - tests if there are more zeros in the data than expected from the simulations
# testZeroInflation(model.nb2)

# testTemporalAutocorrelation() - tests for temporal autocorrelation in the residuals
# simulationOutput <- simulateResiduals(fittedModel = model.nb2)
# testTemporalAutocorrelation(simulationOutput,
#                             df_model$location_month)

# testOutliers() - tests if there are more simulation outliers than expectedtestOutliers() - tests if there are more simulation outliers than expected
# a <- testOutliers(model.nb2)
# testOutliers(model_noOutliers)

# influence plot
# outliers to remove
# car::influencePlot(model.nb2)
# 
# 
# df_aux <- df_model[c(-615,-631,-1285,-1332),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
# 
# df_aux <- df_aux[c(-40,-456,-766,-822,-1508),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
# 
# df_aux <- df_aux[c(-327,-835,-859,-931,-1546),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
# 
# df_aux <- df_aux[c(-177,-627,-1040,-1274,-1355,-1439),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
#                             location+month+offset(log(cases)), 
#               data = df_aux) 
# car::influencePlot(model.nb3)
sessionInfo()